How it works
Tau is a time-series database for recording how values change over time. It is not a general-purpose relational store: no rows, no tables, no indexes. Only temporal intervals, a query language built around them, and a storage model that makes correction cheap.
This document describes why Tau is built the way it is, not just how it works. Decisions that might look odd from the outside have reasons. Knowing the reasons lets you contribute without fighting the grain of the design.
The Problem
Most databases assume a row represents current truth. To record history you add timestamps, but the model stays mutation-oriented: an update replaces the old value.
Tau starts from the opposite assumption. Every fact has a time range over which it was true. A measurement saying "temperature was 22 °C from noon to 1 pm" is a first-class value, not a derived view. Updating it means appending a correction: a new layer covering some or all of the same range with a newer value. The old layer is never touched.
This makes Tau correct by default for append-only workloads:
- Sensor streams where values arrive out of order
- Financial time series where prices are restated
- Audit trails where the history of corrections is itself interesting
The cost is that every query must resolve which layer wins at each point in time. That resolution logic is the sweep-line compaction algorithm and the layered query model.
Architecture
Tau is structured as a library (libtau) consumed by three binaries: the TCP server (tau), the interactive REPL (ctl), and the deterministic simulation tester (dst). The library exposes a clean Executor API; auth, TLS, and network concerns live exclusively in the server.
Stmt → Executor → Database<Value> → Store<V> + optional Wal
Primitives
Tau<V>
An atomic temporal fact: value V is true over the half-open interval [start, end).
Tau { start: i64, end: i64, value: V }
The half-open interval is intentional. Adjacent intervals tile cleanly: [0, 10) and [10, 20) cover [0, 20) with no overlap and no gap. Equality on the boundary belongs unambiguously to the later interval.
Tau::new asserts start < end. There are no zero-width taus.
Timestamps are i64 (nanoseconds, milliseconds, or any other unit the caller agrees on). Tau treats them as opaque ordered integers.
Layer<V>
A batch of taus that arrived together: a sorted, non-overlapping Arc<[Tau<V>]>.
Layer { id: u64, min_start: i64, max_end: i64, taus: Arc<[Tau<V>]> }
Layers are immutable once created. Cloning a layer is an atomic reference-count bump.
min_start and max_end are skip-check bounds. A point query for timestamp t can skip an entire layer with two comparisons (t < min_start || t >= max_end) before touching the data.
Within the slice, a binary search (partition_point on tau.end <= t) locates the candidate in O(log n).
Lens<V>
A named temporal function: either a Base lens backed by a store, or a Derived lens backed by a lazy closure.
Lens::Base -- delegates to the store layer stack
Lens::Derived(f) -- f: Arc<dyn Fn(Timestamp) -> Option<V>>
Derived lenses are closures compiled at DERIVE time. At query time they call f(t). The closure captures references to other lenses so derivations chain: DERIVE c AS a + b compiles into a closure that calls the closures of a and b.
Cycle detection runs at DERIVE time by walking the dependency graph.
Storage
Backends
InMemory: a HashMap<name, Vec<Layer<V>>> with no I/O. Used for tests and ephemeral workloads.
Disk: a binary file:
magic (3 bytes) "TAU" (plaintext) or "TAUE" (encrypted)
[entries...]
length (u32 LE)
crc32 (u32 LE) or nonce+tag (encrypted)
payload bytes
On open, the file is read entry by entry, integrity-checked, and replayed into the in-memory layer stack. The file handle stays open in append mode.
Encryption is AES-256-GCM with a random 12-byte nonce per entry. The key is never stored; it must be supplied via TAU_ENCRYPTION_KEY at startup. The TAUE magic prevents accidentally opening an encrypted file without a key.
Write-Ahead Log
The WAL sits between the caller and the store. Every mutation writes to the WAL first, fsyncs, then writes to the store. On startup the WAL is replayed before the store is opened.
WAL entries are line-oriented text:
<crc32hex> <base64-payload> -- data entry
S:<checksum> <CREATE LENS ...> -- schema DDL
SE:<crc32hex> <base64-encrypted> -- encrypted schema DDL
Schema entries carry the raw TauQL text of CREATE LENS and DERIVE LENS. On replay, these are re-parsed and executed with in_replay = true, which suppresses re-appending them to the WAL.
The WAL is checkpointed after compaction: a fresh snapshot of in-memory state is written to a new file and swapped in, bounding disk usage.
Compaction
Each base lens accumulates layers over time. A point query must walk layers newest-first until it finds a covering tau. With many layers this is linear in the layer count.
Auto-compaction fires when a lens exceeds a threshold (default: 4 layers). It runs a sweep-line algorithm over all layers:
- Build a list of start/end events, one pair per tau across all layers.
- Sort events by timestamp; ends before starts at ties.
- Walk events. A max-heap keyed by
(layer_idx, tau_idx)tracks which layers are active at each point. The layer with the highest index (newest) wins. - Emit a merged segment whenever the winning value changes.
This is O(E log E) where E is the total number of taus. After compaction, the lens has exactly one layer.
Store::append returns a bool indicating whether compaction fired. The Database layer uses this to decide whether to WAL-checkpoint.
Database and Executor
Database<V>
Database<V> owns a Store<V> behind RwLock and an optional Wal behind Mutex. Append order is always:
WAL.write(entry) → WAL.fsync() → Store.append(layer)
A WAL fsync failure leaves the in-memory store unchanged; the entry is not committed. No partial-write window is visible to readers.
Executor
Executor is the top-level query processor. It owns a HashMap<String, DbState> of named databases, an active-database pointer, and a UserStore.
Each DbState carries:
- A
Database<Value>(live store + WAL) - A
HashMap<name, Type>for base lens declarations - A
HashMap<name, Expr>for derived lens ASTs - A monotonic
next_layer_idcounter
Two entry-point pairs:
exec / exec_read: unrestricted. Used by library consumers, tests, and schema replay (in_replay = true prevents DDL from being re-appended).
exec_as(stmt, caller) / exec_read_as(stmt, caller): looks up caller in self.users, calls check_permission, then delegates. Used by the TCP server for every authenticated session. SHOW DATABASES is post-filtered to only databases the caller holds any grant on.
The split is intentional: embedding Tau as a library bypasses auth entirely. Auth is a server concern, not an engine concern.
Query Language
TauQL is a line-oriented command language: one statement in, one response line out. The grammar is minimal: no implicit join, no subquery, no transaction syntax.
The parser is a nom combinator in libtau::ql::parser. Adding a new statement requires changes to four files: ast.rs (new variant + Display), parser.rs (production + alt entry), executor.rs (handler + check_permission arm), and bin/tau/main.rs (output formatter).
Operator precedence from low to high: ||, &&, comparison, additive, multiplicative, unary, primary.
Server
Protocol
Line-oriented text: one TauQL statement per line in, one response line out.
Concurrency
Each accepted connection runs on its own OS thread. All threads share one Arc<RwLock<Executor>>. Read-only statements take the read lock and run concurrently. Write statements take the exclusive write lock.
This is simple and correct. The tradeoff: a slow write (e.g. WAL fsync on a slow disk) blocks all concurrent reads. For write-heavy workloads this can be a bottleneck; a per-database lock would improve write-read concurrency but adds complexity.
Connection capacity
--max-connections N (default 1024) caps concurrent client threads. Connections beyond the cap receive ERR server at connection limit. The accept loop tracks in-flight work with a single AtomicUsize.
--idle-timeout-secs SECS (default 300) installs a per-socket read/write timeout.
Design Decisions
Immutable layers over in-place mutation
Mutation would require finding and splitting or replacing existing taus. With immutable layers, a correction is an append. The WAL writes one new entry. The in-memory state gains one new layer. The old data is untouched.
The tradeoff is query cost: O(log n) per layer rather than O(log n) total. Compaction restores O(log n) by collapsing layers.
No transaction syntax
Tau does not expose transactions. The WAL provides durability; compaction provides space reclamation. Both happen automatically. Multi-statement atomicity would require locking, MVCC, or 2PC, none of which fit the single-writer append-only model cheaply.
If multiple appends must land atomically, batch them: APPEND LENS x 0 10 1, 10 20 2 is a single layer and a single WAL entry.
exec vs exec_as split
Auth is a transport concern. A Tau binary embedded in another process, reading sensor data directly, has no need for network authentication. Keeping the auth check out of exec means embedded use never pays the overhead or requires a dummy user.
Arc-backed layers
Layer data is immutable once created. Sharing it across the read path without copying is safe. A compacted layer replaces the old stack atomically; concurrent readers holding references to old layers read consistent data until they finish.
WAL-first ordering
Writing to the WAL before writing to the store means a crash between the two leaves an entry that replays on next startup, completing the write. The only risk is a duplicate replay, which the idempotent append semantics handle: replaying a layer that already exists adds it again, but the query result is identical (newest-layer-wins picks the same value either way).
Thread-per-connection instead of async
The server uses std::thread rather than Tokio or async-std. For a database server where each connection is long-lived and query processing is CPU-bound (compaction, expression evaluation) rather than I/O-bound, synchronous threads are simpler to reason about. Async would complicate the RwLock usage without a clear throughput gain for the expected connection counts.
Two modes in dst
dst serves as both the correctness tester and the performance measurement tool. In full mode it spawns real server processes across all config combinations. In embedded mode (--quick) it uses the library executor directly, covering centuries of simulated time without I/O overhead. The same seed printed at startup makes any failure reproducible from a single flag.